Reading Assignment 4

Introduction to CUDA Programming

Write your answers in a PDF and upload the document on Gradescope for submission. The due date is given on Gradescope. Each question is worth 10 points.

Please watch the videos 19 through 24 and the slides before answering these questions.

Slide Deck

Describe three features that differentiate CPU from GPU processors.

What is the double precision performance of a Quadro RTX 6000 compared to its single precision performance?
Assume you launch a CUDA kernel from the CPU code. When the function call returns on the CPU, does it mean that the CUDA kernel execution has completed on the GPU?
What is an NVIDIA tensor core?
How many SMs are required to run a CUDA thread block? Does the answer depend on the number of threads in the block?

Slide Deck

Explain the difference between sbatch and srun in SLURM.
What is the SLURM command to cancel a job?
Explain the meaning of the keywords __global__ and __device__ in CUDA.
Explain what the following built-in CUDA variables are: threadIdx, blockDim, blockIdx.
Starter code. Read the program firstProgram.cu. Then, fill-in the TODOs in R4.cu (contained in the zip file) so that you compute an array of type float with entries
```
out[i] = 1. / i;
```
Please read as well addMatrices.cu where you will find useful examples. The size of the array should be equal to 100,000. Each CUDA thread should compute a single entry out[i]. The number of threads in a CUDA block should be chosen equal to 512.
Explain the difference between a virtual architecture and a real architecture in nvcc.
What are the recommended nvcc options to compile CUDA code on icme-gpu?
Explain what the shorthand option --gpu-architecture=sm_75 does during the compilation process using nvcc.